生成的对抗网络(GAN)表现出了真实图像的令人印象深刻的图像生成质量和语义编辑功能,例如更改对象类,修改属性或传输样式。但是,将这些基于GAN的编辑应用于每个框架的视频,不可避免地会导致时间闪烁的伪影。我们提出了一种简单而有效的方法,以促进时间连贯的视频编辑。我们的核心思想是通过优化潜在代码和预训练的发电机来最大程度地减少时间光度不一致。我们评估了在不同领域和GAN倒置技术上编辑的质量,并对基线显示出优惠的结果。
translated by 谷歌翻译
创建视频是为了表达情感,交换信息和分享经验。视频合成很长时间以来一直吸引了研究人员。尽管视觉合成的进步驱动了迅速的进展,但大多数现有研究都集中在提高框架的质量和之间的过渡上,而在生成更长的视频方面几乎没有取得进展。在本文中,我们提出了一种基于3D-VQGAN和Transformers的方法,以生成具有数千帧的视频。我们的评估表明,我们的模型在16架视频剪辑中培训了来自UCF-101,Sky TimeLapse和Taichi-HD数据集等标准基准测试片段,可以生成多样化,连贯和高质量的长视频。我们还展示了我们通过将时间信息与文本和音频结合在一起来生成有意义的长视频的方法的条件扩展。可以在https://songweige.github.io/projects/tats/index.html上找到视频和代码。
translated by 谷歌翻译
大多数计算机视觉系统将无失真的图像作为输入。但是,当摄像机和对象在捕获过程中进行运动时,使用广泛使用的滚动器(RS)图像传感器会遭受几何变形。已经对纠正RS扭曲进行了广泛的研究。但是,大多数现有作品都严重依赖场景或动作的先前假设。此外,由于流动翘曲,运动估计步骤要么过于简单或计算效率低下,从而限制了它们的适用性。在本文中,我们使用全局重置功能(RSGR)使用滚动快门来恢复清洁全局快门(GS)视频。此功能使我们能够将纠正问题变成类似Deblur的问题,从而摆脱了不准确且昂贵的明确运动估计。首先,我们构建了一个捕获配对的RSGR/GS视频的光学系统。其次,我们开发了一种新型算法,该算法结合了空间和时间设计,以纠正空间变化的RSGR失真。第三,我们证明了现有的图像到图像翻译算法可以从变形的RSGR输入中恢复清洁的GS视频,但是我们的算法通过特定的设计实现了最佳性能。我们的渲染结果不仅在视觉上吸引人,而且对下游任务也有益。与最先进的RS解决方案相比,我们的RSGR解决方案在有效性和效率方面均优异。考虑到在不更改硬件的情况下很容易实现,我们相信我们的RSGR解决方案可以潜在地替代RS解决方案,以使用低噪音和低预算的无失真视频。
translated by 谷歌翻译
在新兴应用中,自主机器人对日常生活的潜在影响是明显的,如精密农业,搜救,救援和基础设施检查。然而,这种应用需要在不明和复杂的一组目标中具有广泛而非结构化的环境,所有这些应用都在严格的计算和功率限制下。因此,我们认为必须安排和优化支持机器人自主权的计算内核,以保证及时和正确的行为,同时允许在运行时重新配置调度参数。在本文中,我们考虑了一个必要的第一步,迈出了自主机器人的计算意识的目标:从资源管理角度来看,基础计算内核的实证研究。具体地,我们对三个嵌入式计算平台进行了用于定位和映射,路径规划,任务分配,深度估计和光流的核的定时,电源和内存性能的数据驱动的研究。我们配置文件并分析这些内核,为计算感知自治机器人提供了解调度和动态资源管理的洞察。值得注意的是,我们的结果表明,内核性能与机器人的运营环境有关,证明了计算感知机器人的概念以及为什么我们的作品对这一目标的关键步骤。
translated by 谷歌翻译
神经辐射场(NERFS)产生最先进的视图合成结果。然而,它们慢渲染,需要每像素数百个网络评估,以近似卷渲染积分。将nerfs烘烤到明确的数据结构中实现了有效的渲染,但导致内存占地面积的大幅增加,并且在许多情况下,质量降低。在本文中,我们提出了一种新的神经光场表示,相反,相反,紧凑,直接预测沿线的集成光线。我们的方法支持使用每个像素的单个网络评估,用于小基线光场数据集,也可以应用于每个像素的几个评估的较大基线。在我们的方法的核心,是一个光线空间嵌入网络,将4D射线空间歧管映射到中间可间可动子的潜在空间中。我们的方法在诸如斯坦福光场数据集等密集的前置数据集中实现了最先进的质量。此外,对于带有稀疏输入的面对面的场景,我们可以在质量方面实现对基于NERF的方法具有竞争力的结果,同时提供更好的速度/质量/内存权衡,网络评估较少。
translated by 谷歌翻译
Few-shot classification aims to learn a classifier to recognize unseen classes during training with limited labeled examples. While significant progress has been made, the growing complexity of network designs, meta-learning algorithms, and differences in implementation details make a fair comparison difficult. In this paper, we present 1) a consistent comparative analysis of several representative few-shot classification algorithms, with results showing that deeper backbones significantly reduce the performance differences among methods on datasets with limited domain differences, 2) a modified baseline method that surprisingly achieves competitive performance when compared with the state-of-the-art on both the mini-ImageNet and the CUB datasets, and 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classification algorithms. Our results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones. In a realistic cross-domain evaluation setting, we show that a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.
translated by 谷歌翻译
We present an unsupervised representation learning approach using videos without semantic labels. We leverage the temporal coherence as a supervisory signal by formulating representation learning as a sequence sorting task. We take temporally shuffled frames (i.e., in non-chronological order) as inputs and train a convolutional neural network to sort the shuffled sequences. Similar to comparison-based sorting algorithms, we propose to extract features from all frame pairs and aggregate them to predict the correct order. As sorting shuffled image sequence requires an understanding of the statistical temporal structure of images, training with such a proxy task allows us to learn rich and generalizable visual representation. We validate the effectiveness of the learned representation using our method as pre-training on high-level recognition problems. The experimental results show that our method compares favorably against state-of-the-art methods on action recognition, image classification and object detection tasks.
translated by 谷歌翻译
Convolutional neural networks have recently demonstrated high-quality reconstruction for single-image superresolution. In this paper, we propose the Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct the sub-band residuals of high-resolution images. At each pyramid level, our model takes coarse-resolution feature maps as input, predicts the high-frequency residuals, and uses transposed convolutions for upsampling to the finer level. Our method does not require the bicubic interpolation as the pre-processing step and thus dramatically reduces the computational complexity. We train the proposed LapSRN with deep supervision using a robust Charbonnier loss function and achieve high-quality reconstruction. Furthermore, our network generates multi-scale predictions in one feed-forward pass through the progressive reconstruction, thereby facilitates resource-aware applications. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of speed and accuracy.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译